Metabolite annotation from knowns to unknowns through knowledge-guided multi-layer metabolic networking

Liquid chromatography - mass spectrometry (LC-MS) based untargeted metabolomics allows to measure both known and unknown metabolites in the metabolome. However, unknown metabolite annotation is a major challenge in untargeted metabolomics. Here, we develop an approach, namely, knowledge-guided multi-layer network (KGMN), to enable global metabolite annotation from knowns to unknowns in untargeted metabolomics. The KGMN approach integrates three-layer networks, including knowledge-based metabolic reaction network, knowledge-guided MS/MS similarity network, and global peak correlation network. To demonstrate the principle, we apply KGMN in an in vitro enzymatic reaction system and different biological samples, with ~100–300 putative unknowns annotated in each data set. Among them, >80% unknown metabolites are corroborated with in silico MS/MS tools. Finally, we validate 5 metabolites that are absent in common MS/MS libraries through repository mining and synthesis of chemical standards. Together, the KGMN approach enables efficient unknown annotations, and substantially advances the discovery of recurrent unknown metabolites for common biological samples from model organisms, towards deciphering dark matter in untargeted metabolomics.


Tutorial of KGMN result visualization and analysis
Zhiwei Zhou

2022-06-05
Introduction Unknown metabolite annotation is one of long-standing challenges in untargeted metabolomics. We develop an approach, namely, knowledge-guided multi-layer network (KGMN), to enable global metabolite annotation from knowns to unknowns in untargeted metabolomics. The KGMN approach integrates three-layer networks, including knowledge-based metabolic reaction network (Network 1), knowledge-guided MS/MS similarity network (Network 2), and global peak correlation network (Network 3). This tutorial will help users to visualize, reproduce and investigate putatively annotated known and unknown metabolites from KGMN.

Installation
The analysis and visualization of KGMN results mainly relies on R package -MetDNA2Vis, and its depended R packages; The Cytoscape software is used for manually visualize networks, and interactively investigate results of KGMN; The ChemDraw software is involved for drawing chemical structures.

Download demo data and unzip the archive.
• All required intermediate files for visualization is provided in '06_visualization' folder.

Network 1
The network 1 is the knowledge-guided metabolic reaction network. For knowns, the KEGG reaction pair network is directly used. For unknowns, an extended KEGG reaction pair network is used. The network expansion is performed with in-silico enzymic reactions (via Biotransformer), and further connected with KEGG reaction pair network. The details of network construction and expansion are described in our KGMN manuscript. It should be note that the KEGG reaction pair network and extended network are built in advance.
To export the network 1, it is easily to run reconstructNetwork1 function as below: # export network 1 for visualization The networks files will be exported in '00_network1' folder. It contains two files, including "edge_table.tsv" and "node_table.tsv" (Figure 2.3.1). These tables can be import into Cytoscape software for visualization.

Network 2
The The networks files will be exported in '01_network2' folder. The "edge_table.tsv" and "node_table.tsv" in this folder can be imported to Cytoscape.

Network 3
The network 3 is the global peak correlation network. This network recognized different ion form peaks derived from peaks from network 2, including adducts, isotopes, neutral losses, and in-source fragments (ISF). The network 3 is used to optimize the annotation and linkage of network 2. The optimization has been completed in KGMN analysis. The details of network 3 construction and optimization can be found in our manuscript.
To export the network 3, it is easily to run reconstructNetwork3 function as below: # export network3 reconstructNetwork3() The networks files will be exported in '02_files_network3' folder. The "edge_table.tsv" and "node_table.tsv" in this folder can be imported to Cytoscape for visualization.

Visualize global networks with Cytoscape
Above networks (Network 1-3) can be imported to Cytoscape software tool for visualization. The process of network visualization is generally similar. Here, we use the above network 1 as a demonstration. The version of Cytoscape used here is 3.8.2.
Below is the step-by-step instruction: 1. Import edge file. Select the "edge_table.tsv" file and open it in the box.
2. Assign column attributes. Click the 'from' column and select it as "source node". Similarly, click the "to" column and select it as "target node". After assigning attributes, click OK to construct a network. To help users reproduce our plot quickly, users can directly import our style file. The styles of different networks are provided here (https://mega.nz/file/tnp1nKjT#LS1oPzcFzw6bbdsLSqGoW4Qggrl_lM2LsPgsyZXilzQ).

S35
The networks files will be exported in '03_subnetworks/your_defined_folder/network 1' folder. Here, the exported folder is "M182T541_M262T526". The "edge_table.tsv" and "node_table.tsv" in this folder can be imported to Cytoscape for visualization. Note: if you run in RStudio, the preview plot of subnetwork 1 will be directly shown in the plot panel.
Similarly, export network 2 and network 3 of this subnetwork can be completed through running retrieveSubNetwork2 and retrieveSubNetwork3 functions, respectively. The preview plots of subnetwork 2 and subnetwork 3 will be shown in the plot panel if you run in RStudio.  The 'network_merge' folder contains node table and edge table for reproduce the merged network.

The script for visualization
Here is a script which contains above codes to help to reproduce above analysis quickly.   The step-by-step instruction has been provided below.

Data preparing.
In this workflow, the data files require KGMN (MetDNA2) processed firstly. Here, we utilized NIST human urine data set as example. The data set has been analyzed with KGMN (v1.0.4), and the results can be downloaded here (https://mega.nz/file/8v50iL6T#oILf8wlVJU_iqTfjcOtH1TRHhnP1GGbvG_ZNb1xniGc).
The folders should look like as below：

S41
The users can browser and select interesting known/unknown peaks in the annotation table "table1_identification.csv" in the "00_annotation_table" folder. It should be note that the selection of targeted peak is customized.
For demonstration, we utilized the unknown peak M262T526 as an example (Figure 5d in manuscript).
The MS/MS spectrum of this peak can be found in the "ms2_data.msp" in "06_visualization" folder.
You can open it with text tool (e.g. Notepad++).

Upload and analysis in MASST.
Users can upload this file to MASST (https://gnps.ucsd.edu/ProteoSAFe/static/gnpssplash.jsp?redirect=auth) to perform repository mining. The users need to login first. Then, click the "query spectrum" button in MASST panel to start the analysis. Copy related texts from MSP file to "title", "precursor m/z", "spectrum input" panel in the web server, respectively.

S43
Modify the search parameters and click "submit" button. The used parameters in KGMN manuscript have been provided below.
When the job finished, you will receive an email with a link. You can view and download results in the webserver.

Result interpretation and visualization.
The downloaded results include 2 ZIP files, "view_all_datasets_matched.zip" and "view_all_file_datasets_matched.zip". The files in packages can be further opened with Microsoft Office Excel or other program tools (e.g. R, Python).
 The table of "view_all_datasets_matched" contains meta information of appeared data sets, like "dataset description", "dataset id", "dataset organisms" and "files count".
Furthermore, we can conclude the species and sample information based on the dataset description. For our examples, it was appeared in 7 datasets, and 3 organisms (where genipapo is from human urine actually according to the data set description).
    In this workflow, users need generate necessary files for different in-silico tools. Here, we use an interesting peak M196T420 as example (Figure 4c). This peak is annotated as an unknown peak in KGMN, while it has 6 possible metabolite candidates.

This integration of KGMN and in-silico
First, generate necessary file for M196T420.

Generate input files for your interested peak.
This step is consistent with MetFrag. We use an interesting peak M196T420 as example.

Output of CFM-ID.
A folder "02_cfmid" will be created in the "M196T420" folder. It contains results of CFM-ID. The "cfmid_result.txt" is the CFM-ID rank result. The "cfmid_pred_spec.msp" is the predicted MS/MS spectra of candidates.

Load required packages, and setting the working directory.
Repeat procedures in MetFrag and CFIM-ID. Set the working directory at 07_insilico_msms, which is localized at KGMN result folder. Then, load some required packages.

Run MS-FINDER
We provided a R function (runMsFinderMatch) to call MS-FINDER. Here, we use the command tool

Output of MS-FINDER.
A folder "03_msfinder" will be created in the "M196T420" folder. It contains results of MS-FINDER.
The result of MS-FINDER is organized as adduct types. The rank result will be 03_msfinder -> [M+H]+ -> result -> Structure result-2055.txt.

The script for connection KGMN and in-silico MS/MS tools
Here is a script contains above codes to help to connect KGMN and in-silico MS/MS tools quickly.